Contents: Unsupervised Learning Project

  1. Part-A: Solution
  2. Part-B: Solution

Part-A: Solution

1. Data Understanding & Exploration:

1A. Read ‘Car name.csv’ as a DataFrame and assign it to a variable.

1B. Read ‘Car-Attributes.json as a DataFrame and assign it to a variable.

1C. Merge both the DataFrames together to form a single DataFrame

1D. Print 5 point summary of the numerical features and share insights.

Observations:

2. Data Preparation & Analysis:

2A. Check and print feature-wise percentage of missing values present in the data and impute with the best suitable approach.

2H. Check for unexpected values in all the features and datapoints with such values.

2B. Check for duplicate values in the data and impute with the best suitable approach.

2C. Plot a pairplot for all features.

Observations:

2D. Visualize a scatterplot for ‘wt’ and ‘disp’. Datapoints should be distinguishable by ‘cyl’.

2E. Share insights for Q2.d.

Observations:

2F. Visualize a scatterplot for ‘wt’ and ’mpg’. Datapoints should be distinguishable by ‘cyl’.

2G. Share insights for Q2.f.

Observations:

2H. Check for unexpected values in all the features and datapoints with such values.

[Hint: ‘?’ is present in ‘hp’]

Quick EDA

Correlation Heatmap

Distribution and Outliers

Remove Outliers

Normalize/Standardize the data with the best suitable approach.

3. Clustering with all Features:

3A. Apply K-Means clustering for 2 to 10 clusters.

3B. Plot a visual and find elbow point.

Let's plot the silhouette score as a function of K :

3C. On the above visual, highlight which are the possible Elbow points.

Another approach is to look at the silhouette score, which is the mean silhouette coefficient over all the instances. An instance's silhouette coefficient is equal to (𝑏−𝑎)/max(𝑎,𝑏) where 𝑎 is the mean distance to the other instances in the same cluster (it is the mean intra-cluster distance), and 𝑏 is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes 𝑏 , excluding the instance's own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.

3D. Train a K-means clustering model once again on the optimal number of clusters.

3E. Add a new feature in the DataFrame which will have labels based upon cluster value.

3F. Plot a visual and color the datapoints based upon clusters.

3G. Pass a new DataPoint and predict which cluster it belongs to.

Improving the quality of clusters in next section...

3. Clustering with reduced Features:

3A. Apply K-Means clustering for 2 to 10 clusters.

With reduced set of features, we have lower cluster errors and higher Silhouette Scores.

3B. Plot a visual and find elbow point.

Let's plot the silhouette score as a function of K :

3C. On the above visual, highlight which are the possible Elbow points.

Another approach is to look at the silhouette score, which is the mean silhouette coefficient over all the instances. An instance's silhouette coefficient is equal to (𝑏−𝑎)/max(𝑎,𝑏) where 𝑎 is the mean distance to the other instances in the same cluster (it is the mean intra-cluster distance), and 𝑏 is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes 𝑏 , excluding the instance's own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.

3D. Train a K-means clustering model once again on the optimal number of clusters.

3E. Add a new feature in the DataFrame which will have labels based upon cluster value.

3F. Plot a visual and color the datapoints based upon clusters.

3G. Pass a new DataPoint and predict which cluster it belongs to.

We can do similar analysis for 2 groups as well to check if we get more clear distinction among groups.

Part-B: Solution

1. Data Understanding and Cleaning:

1A. Read ‘vehicle.csv’ and save as DataFrame.

1B. Check percentage of missing values and impute with correct approach.

1C. Visualize a Pie-chart and print percentage of values for variable ‘class’.

Insights:

There is big imbalance in the target vector.

If the imbalanced data is not treated beforehand, then this will degrade the performance of the ML model. Most of the predictions will correspond to the majority class and treat the minority class of features as noise in the data and ignore them. This results in a high bias and low performance of the model.

A widely adopted technique for dealing with highly unbalanced datasets is called re-sampling.

Two widely used re-sampling methods are:

1D. Check for duplicate rows in the data and impute with correct approach.

Quick EDA

Pairplot

Correlation Heatmap

Distribution and Outliers

Remove Outliers

2. Data Preparation:

2A. Split data into X and Y. [Train and Test optional]

2B. Standardize the Data.

3. Model Building:

3A. Train a base Classification model using SVM.

3B. Print Classification metrics for train data.

3C. Apply PCA on the data with 10 components.

3D. Visualize Cumulative Variance Explained with Number of Components.

3E. Draw a horizontal line on the above plot to highlight the threshold of 90%.

3F. Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.

3G. Train SVM model on components selected from above step.

3H. Print Classification metrics for train data of above model and share insights.

Evaluation metrics allow us to estimate errors to determine how well our models are performing:

Accuracy: ratio of correct predictions over total predictions.

Precision: how often the classifier is correct when it predicts positive.

Recall: how often the classifier is correct for all positive instances.

F-Score: single measurement to combine precision and recall.

Insights:

4. Performance Improvement:

4A. Train another SVM on the components out of PCA. Tune the parameters to improve performance.

Use SVM without Oversampling

Use SVM with Oversampling

Use automated search without Oversampling for hyper-parameters.

Use automated search with Oversampling for hyper-parameters.

4B. Share best Parameters observed from above step.

4C. Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.

Insights:

5. Data Understanding & Cleaning:

5A. Explain pre-requisite/assumptions of PCA.

There are some assumptions in PCA which are to be followed as they will lead to accurate functioning of this dimensionality reduction technique in ML. The assumptions in PCA are:

5B. Explain advantages and limitations of PCA.

PCA offers multiple benefits, but it also suffers from certain shortcomings:

Advantages of PCA:

Disadvantages of PCA:

References:

  1. Towards Data Science
  2. Kaggle. Kaggle Code
  3. KdNuggets
  4. AnalyticsVidhya
  5. Wikipedia
  6. Numpy
  7. Pandas
  8. SciPy
  9. MatplotLib
  10. Seaborn
  11. Python
  12. Plotly
  13. Bokeh
  14. RStudio
  15. MiniTab
  16. Anaconda